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The heterogeneous variational nodal method (HVNM) has emerged as a potential approach for solving high- 
fidelity neutron transport problems. However, achieving accurate results with HVNM in large-scale problems 
using high-fidelity models has been challenging due to the prohibitive computational costs. This paper presents 
an efficient parallel algorithm tailored for HVNM based on the Message Passing Interface standard. The algo- 
rithm evenly distributes the response matrix sets among processors during the matrix formation process, thus 
enabling independent construction without communication. Once the formation tasks are completed, a collec- 
tive operation merges and shares the matrix sets among the processors. For the solution process, the problem 
domain is decomposed into subdomains assigned to specific processors, and the red-black Gauss-Seidel itera- 
tion is employed within each subdomain to solve the response matrix equation. Point-to-point communication is 
conducted between adjacent subdomains to exchange data along the boundaries. The accuracy and efficiency of 
the parallel algorithm are verified using the KAIST and JRR-3 test cases. Numerical results obtained with mul- 
tiple processors agree well with those obtained from Monte Carlo calculations. The parallelization of HVNM 
results in eigenvalue errors of 31 pcm/-90 pcm and fission rate RMS errors of 1.22%/0.66%, respectively, for 
the 3D KAIST problem and the 3D JRR-3 problem. In addition, the parallel algorithm significantly reduces 
computation time, with an efficiency of 68.51% using 36 processors in the KAIST problem and 77.14% using 


144 processors in the JRR-3 problem. 
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I. INTRODUCTION 


The solution of the neutron transport equation plays a piv- 
otal role in the analysis of neutron distribution in a nuclear 
system. In recent years, with the advancements in com- 
putational resources, the one-step neutron transport method 
with homogenization eliminated has garnered increasing at- 
tention as a prominent research focus. The Method of Char- 
acteristics (MOC) [1, 2] has been identified as a promis- 
ing method for one-step whole-core neutronics calculation. 
The fundamental idea behind this method is to generate a 
set of parallel rays for each discretized angle and solve the 
one-dimensional (1D) neutron transport equation along these 
rays. However, applying MOC directly to three-dimensional 
(3D) whole-core domains leads to prohibitively high com- 
putational costs. Therefore, a common practice is to em- 
ploy the two-dimensional/one-dimensional (2D/1D) approxi- 
mation, known as 2D/1D-MOC [3-5]. In 2D/1D-MOC, the 
coupling of 2D MOC calculation in the lateral plane with the 
diffusion or transport calculation in the axial direction strikes 
an optimal balance between accuracy and computational 
costs. Several neutronic codes based on this method have 
been developed, including MPACT [3], PROTEUS-MOC [6], 
PANDAS-MOC [7], NECP-X [8, 9] and SHARK [10]. How- 
ever, 2D/1D-MOC still faces challenges, such as the com- 
plexity of the coupling strategy between 2D and 1D calcula- 
tion and potential convergence issues when refining the axial 
mesh [11, 12]. 
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The variational nodal method (VNM) offers another option 
for one-step whole-core neutronics calculations. This method 
utilizes the functional for second-order even-parity transport 
equation, with odd-parity Lagrange multipliers employed to 
enforce nodal balance. Response matrices (RMs) are ob- 
tained using a classical Ritz procedure. The VNM was first 
proposed in the 1980s and initially applied to homogenous 
node problems [13]. Over the years, VNM-based codes, such 
as VARIANT [14, 15] and VITAS [16-24], have emerged, 
benefiting from its accuracy and adaptability to mesh geome- 
try. Since 1997, the VNM has expanded its capability to han- 
dle heterogeneous materials within the nodes, enabling high- 
fidelity neutronics calculations. 


In 2017, a significant milestone was reached with the devel- 
opment of the 3D Heterogeneous Variational Nodal Method 
(HVNM), specifically designed for pin-resolved problems. 
This method, implemented in PANX [25, 26] and VITAS [17, 
21], treats each pin cell as a single node and utilizes iso- 
parametric finite element to accurately represent the pin- 
resolved geometry. Angular expansion is achieved using 
spherical harmonics, while radial and axial leakage expan- 
sion employ polynomials and piece-wise constants, respec- 
tively. HVNM directly performs full 3D calculations without 
coupling calculations between 2D and 1D domains, as seen 
in 2D/1D-MOC. Therefore, HVNM avoids lateral integration 
and eliminates issues associated with negative leakage terms. 
Recent research [27] has compared the accuracy and effi- 
ciency of HVNM and 2D/1D-MOC in pin-resolved problems. 
It was reported that for the KAIST problem, the NuScale 
problem, and the Beavrs problem, HVNM produces a more 
accurate pin power distribution and superior computational 
efficiency compared to 2D/1D-MOC [27]. This demonstrates 
the significant potential of HVNM as an alternative option to 
2D/1D-MOC for one-step neutronics calculation. 
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In our previous publication [17], we introduced and ver- 
ified the high-fidelity modeling capability of HVNM using 
the C5G7 benchmark problem set. It is worth noting that the 
previous verifications were limited to relatively small-scale 
pin-cell geometry cases. Therefore, further verification of 
HVNM is necessary to comprehensively investigate its fea- 
sibility for larger problems. In addition, it is crucial to ex- 
amine whether the method can be applied to problems with 
fine mesh sizes, such as plate-type assemblies with fuel plates 
at the millimeter scale. Unfortunately, the limited serial ca- 
pability of HVNM has hindered its ability to achieve suffi- 
cient space-angle orders for the desired accuracy when deal- 
ing with strong heterogeneous problems or to calculate prob- 
lems significantly larger than those of the C5G7 benchmark 
problem. Consequently, there is an urgent need for research 
on parallel algorithms for HVNM. 


Prior to this work, significant efforts have been devoted to 
the development of parallel algorithms for the VNM. Sev- 
eral parallel strategies have been proposed, including a par- 
allel approach based on the Message Passing Interface (MPI) 
standard implemented in VARIANT. However, the existing 
parallel implementation of the VNM was only devoted to the 
axial planes [28], limiting its applicability to 3D problems. 
Another parallel approach [29] based on non-overlapping do- 
main decomposition has been investigated for the solution 
of a red-black algorithm; however, its restriction to regular- 
shaped finite elements hinders its effectiveness in addressing 
heterogeneous problems. Furthermore, a hybrid paralleliza- 
tion of HVNM for pin-resolved neutron transport calculations 
has been presented by Wang et al [30]. However, the study 
lacks a detailed analysis of parallel efficiency and is confined 
to Pressurized Water Reactors. These limitations highlight 
the research gap that still exists in developing a comprehen- 
sive and efficient parallel algorithm, which is specifically tai- 
lored for HVNM, capable of addressing the challenges posed 
by intense heterogeneity and large-scale neutron transport 
problems. Therefore, this work aims to fill this research gap 
by proposing an efficient parallel algorithm for HVNM and 
conducting a thorough analysis of its parallel efficiency. 


In this study, we propose a parallel formulation specifically 
tailored for HVNM within an MPI framework. Consider- 
ing HVNM as a representative RM method, the procedure of 
HVNM is divided into two steps: (a) constructing the RMs 
and (b) solving the resulting matrix equations. In step (a), 
each RM is constructed independently, which inherently al- 
lows for parallelism. Therefore, we employ a specialized par- 
allel strategy, rather than domain decomposition, for RM for- 
mation to ensure optimal load balance. This approach evenly 
distributes the computational workload among MPI proces- 
sors, optimizing the parallel performance. The solution pro- 
cess is parallelized through non-overlapping domain decom- 
position. The entire space domain is divided into multiple 
subdomains, with each subdomain assigned to an MPI pro- 
cessor. The subdomains are coupled through interface nodes 
located along their boundaries. 

The remainder of this paper is organized as follows. Sec- 
tion II introduces the theoretical models of HVNM and the 
parallel algorithm. In Section II, numerical results for rep- 
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resentative heterogenous neutron transport problems, KAIST 
and JRR-3, are obtained to verify the accuracy and perfor- 
mance of the parallel algorithm. The parallel performance 
is evaluated by comparing the CPU time between serial and 
multi-core parallel computations. Finally, Section IV con- 
cludes the paper and discusses possible future improvements. 


Il. THEORETICAL DESCRIPTIONS 


A. Theoretical models for neutron transport 


This section provides the essential equations for HVNM, 
but for a comprehensive understanding of the derivation pro- 
cess and detailed matrix expressions, please refer to Ref. [25]. 
HVNM is based on the second-order neutron transport equa- 


tion (NTE) with isotropic scattering approximation. The 
second-order NTE within the group takes the form of 
-9. VE H (r)Q . Vet (r, Q) m 


+ Xi (r)pt (r,Q) = £.(r)olr) + g(r), 


where X, (r) and £, (r) are the macroscopic total and scatter- 
ing cross-sections, respectively. wt (r, Q) is the even-parity 
angular flux at position r in direction Q. ¢(r) is the scalar 
flux satisfying (r) = f y(r, Q)dQ. q(r) is the group source 
consisting of scattering and fission terms: 


ar) = X. Esg (r)óg (r) + kage Xg DUE zo (7) bq'(r); 
FG g' (2) 


In addition, the odd-parity angular flux %7 (r,Q) is defined 
and satisfies 
wo (r, Q) = -E7 (r)Q - Vet (vr, Q) (3) 


In HVNM, the second-order NTE is formulated as a varia- 
tional principle in terms of a global functional F[Y+, Y7] 


Fit y] = X Folyt, yT], (4) 


which is a superposition of the functional for each node, 


Folyt, y7]: 
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In summary, the spatial and angular independent variables r 
and Q are suppressed. In local coordinates, dV = dxdydz 
with —Az/2 < x < Aa/2, —Ay/2 < y < Ay/2, 


153 
154 
155 
15 


D 


157 


158 


159 


16 


6 


16 


162 


163 


164 


165 


166 


16 


g 


16 


© 


16 


© 


17 


=] 


17 


172 


173 


174 


175 


176 


177 
178 
179 
180 


18 


182 
183 
184 
185 
186 
187 


188 


189 


190 
191 
192 


193 


—Az/2 < z < Az/2. n, is the outward normal to the lat- 
eral interfaces extending over the periphery I’, while n,, and 
m,_ are the outward normal to the top and bottom axial in- 
terfaces, respectively. 

Within the node, the even-parity angular flux is expanded 
as 


Ca a 2) = f7 (2) ® g” (x, yy (9N), 


where f(z) and g(x, y) are vectors of orthonormal polynomi- 
als and continuous finite-element trial functions, respectively. 
@ represents a tensor product. y(Q) is a vector of expansion 
moments with respect to Q . Correspondingly, the scalar flux 
is expanded as 


(6) 


olr) = f7 (2) 8 g7 (t, y)ẹ, (7) 
where œ is a vector of scalar flux moments, satisfying @ = 
f dQ2p(Q). It’s worth noting that the radial flux distribu- 
tion within the node is represented by continuous, piecewise 
finite-element functions. This treatment allows for the dis- 
continuities in cross-sections at the finite-element interfaces 
within each node, thereby eliminating the requirement for ho- 
mogeneous nodes. 
The odd-parity angular flux is expanded as 


Wy (r, Q) = y? (Q) @h" (2, y)x, (8) 


and 


b; (7,2) = [f7 (2) @ £7 (6) @ 9; (Q)x, 


(9) 
Y=Uy ESY, 

on the axial and lateral interfaces, respectively. h(x, y) de- 
notes a piecewise constant vector, with each of its compo- 
nents equal to one over the domain of one or more finite ele- 
ments and zero elsewhere. y,(QQ) and y4 (Q) are vectors con- 
sisting of odd-parity spherical harmonics defined on the axial 
and lateral interfaces, respectively. X, and x are expansion 
moment vectors. The material interfaces within a single node 
can be explicitly described using these trial functions, ensur- 
ing that there is no smearing between the materials at axial 
interfaces. 

Inserting Eq. (6) through Eq. (9) into Eq. (5) results in the 
discretized functional in the form of 


F,(W(®), xx] = | dW (2) AYN) 
— pTI, ® Fep — 26"q 
+25 f ary" (HE,(M)x, 1 
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Requiring the discretized functional given in Eq. (10) to be 
stationary with respect to variation in Y(Q), x, and v-, and 
employing the linear transformation of variables, finally re- 
sults in the following equations: 
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j’ =Bq+ Rj” (11) 


=Vq- C(t- i) (12) 
where 7* and j~ stand for the vectors of the expansion 
moments of outgoing and incoming partial currents along 
the nodal surfaces, respectively. B, R, V, and C are the 
nodal RMs, which are coefficient matrices solely related to 
the nodal geometry and macroscopic cross-sections. Eq. (11) 
signifies the relationship between the neutron source within 
the node and the partial current on the node’s surface, while 
Eq. (12) represents the neutron conservation within the node. 

The numerical solution process in HVNM involves three 
levels of iteration [27]. The outermost iteration is the Fission- 
Source (FS) iteration, which utilizes the Power Method [31]. 
In each FS iteration, if up-scattering is present, the multi- 
group (MG) flux system is solved using the legacy Gauss- 
Seidel (GS) algorithm, referred to as the MG iteration. How- 
ever, if there is no up-scattering, only a single sweep over 
the energy groups is required. Within each energy group, 
the within-group (WG) RM system, expressed by Eq. (11), is 
solved using the Red-Black Gauss-Seidel (RBGS) algorithm, 
referred to as WG iteration. The detailed solution process is 
presented in Algorithm 1. 


Algorithm 1 The HVNM iteration process 

1: Initialize angular fluxes, partial currents and eigenvalue. 

2: Initialize fission source. 

3: FS iteration: Do n = 1, Nmax 

Calculate total source term. 

q= 2 Zsgg'Pg' + kag Xs 2 VE fg Pag! 
a #9 g 

MG iteration: Do m = 1, Mmax 

Energy group sweep: Do g = 1, G 

Calculate source term within group g. 

9: WG iteration: Do i = 1, Imax 


10: Solve jt = Bq + Rj” using RBGS algorithm. 
11: IF j* converged, EXIT 

12: End Do WG iteration 

13: Update scalar flux using @ = Vq — C(j* — j`) 
14: End Do Energy group sweep 

15: If @ converged, EXIT 

16: End Do MG iteration 

17: Update q and keg. 

18: Ifq and keg converged, EXIT 


19: End Do FS iteration 


Several techniques, specifically tailored for HVNM, in- 
cluding the flat source region (FSR) acceleration method [25], 
partitioned matrix (PM) method [32], and quasi-reflected in- 
terface condition (QRIC) method [26], are employed to accel- 
erate the solution process. These acceleration methods have 
been elaborated in our previous publications [21] and thus are 
not described in detail in this paper. 

The FSR acceleration method aims to reduce the degrees 
of freedom within the node by partitioning the finite elements 
into FSRs. Within each FSR, the group source at each finite- 
element vertex is approximated as the average source within 
that FSR. 
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The PM method involves decomposing the response matri- 
ces into a low-order matrix corresponding to the surfaces of 
each node and a high-order spatial-angular matrix. The high- 
order terms are used to construct a correction source term for 
solving the low-order diffusion matrix equation during the it- 
eration process. 

The QRIC method aims to reduce the number of angular 
degrees of freedom on the interfaces by applying the reflective 
boundary condition (B.C.) to the high-order angular terms. 
This reduction leads to a smaller size of the response matrix, 
resulting in improved computational efficiency and reduced 
memory requirements. 


B. Parallel algorithm 


The parallel algorithm tailored for HVNM is based on MPI. 
In the subsequent sections, we introduce the parallel algo- 
rithms for matrix formation and solution. Although HVNM 
incorporates acceleration methods such as PM, FSR, and 
QRIC, it is not necessary to consider the parallelization of 
these acceleration methods themselves. The parallel algo- 
rithm described in the following sections is fully compatible 
with these acceleration techniques. 


1. Matrix formation parallel algorithm 


According to the expressions of RMs (i.e., B, R, V and 
C designated as a matrix set), they are purely dependent on 
the node’s geometry and macroscopic cross-sections. This 
implies that for a specific energy group, nodes with the same 
geometry, material, and finite element grid (categorized as a 
unique node) will have identical matrix sets. Therefore, the 
formation of matrix sets is an independent operation for each 
unique node and energy group; this independence allows for 
perfect scalability in a parallel computing environment using 
the MPI framework. Each MPI processor can construct ma- 
trix sets for a subset of unique nodes and energy groups si- 
multaneously, without any communications. 

The most straightforward and intuitive parallel scheme is 
to evenly assign the matrix formation tasks to all the pro- 
cessors to achieve optimal load balance. Assuming there are 
NG energy groups and NU unique nodes, a total of NM = 
NGxNU matrix sets need to be constructed. The formation 
of NM matrix sets is partitioned by N P processors so that 
each processor undertakes a part of the calculation simultane- 
ously. If NM is exactly divisible by N P, the index of matrix 
sets to be calculated on the processor p (p € [0, NP — 1)) 
can be defined as i, € [p AM +1, (p+ 1). AHL], How- 
ever, in cases where NM cannot be evenly divided by NP, 
the bounds of 7,, need to be adjusted to allocate the remaining 
matrix sets to specific processors. Fig. | illustrates a parti- 
tion example with NU = 2 and NG = 4. When NP = 2, the 
matrix sets are evenly distributed among 2 processors with 
each processor being assigned 4 matrix sets. When NP = 3, 
Processor 0 and Processor 1 are assigned 3 matrix sets while 
Processor 2 is assigned 2 matrix sets. The partition scheme 
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enforces that the number of matrix sets assigned to each pro- 
cessor is as balanced as possible. 


Matrix set index 1 2 3 4 5 6 7 8 
Matrix set | (1,1) | a| | a,3)} | a4) | (2,1) (2, 2) (2,3) (2,4) 
Processor 0 ( Processor 1 
phate a, 1) (1, 2) | (1, 3) | (1, 4) (2,1) (2,1) (2, 3) (2, 4) 
N 
Va 
Processor 0 Processor 1 Processor 2 

NP=3 a, 1) (1, 2) | (1, 3) d,4) | | 2,1) (2,1) (2,3) (2, 4) 

a represents the matrix set corresponding to unique 

[8 node ‘u’ and energy group ‘g’. 


Fig. 1. (Color online) Matrix sets assignment for each processor. 


Each processor requires the corresponding set of RMs for 
its subdomain during the solution of the matrix equations 
presented in Eqs. (11) and (12). However, the distribution 
scheme of matrix sets may result in some processors not hav- 
ing the required response matrix sets locally. Instead, these 
matrix sets are allocated to other processors for construction. 
In such cases, communication between processors is neces- 
sary to ensure that each processor obtains the required re- 
sponse matrix sets for its subdomain. The communication 
scheme employed in this study involves transmitting the lo- 
cal matrix sets constructed by each processor to a designated 
processor, which preforms the merging process to generate a 
global matrix set encompassing all unique nodes and energy 
groups. Finally, the global matrix set is dispatched to all other 
processors. 

The parallelization for matrix formation is outlined in Al- 
gorithm 2, which highlights the steps involved in distributing 
the matrix formation tasks and preforming the necessary com- 
munication to generate the global matrix set. While no com- 
munication is required between processors during the calcula- 
tion of matrix sets, load imbalances may occur if the number 
of matrix sets cannot be evenly distributed among the pro- 
cessors. In addition, the collective manipulations required to 
generate the global matrix set and transfer it to each proces- 
sor introduce communication overhead, which can impact the 
parallel performance. The communication overhead is mainly 
influenced by both the number of processors involved in the 
communication and the size of the matrices that need to be 
communicated. The communication overhead becomes more 
significant as the number of processors and the size of the 
matrices increase. 


2. Solution parallel algorithm 


In the parallelization of the solution process in HVNM, 
non-overlapping domain decomposition is employed. The en- 
tire space domain is divided into multiple subdomains, with 
each subdomain assigned to an MPI processor. Examples 
of 3D non-overlapping domain decomposition are shown in 


Algorithm 2 Parallelization for matrix formation 
1: MPI Initialization 


3 

4: Nrm = MOD(NM, NP) 

5: If (Nrm = 0) Then 

6: NM, =NM/NP forpé€ [0,NP-— 1] 

7: Else 

8: NM, = int(NM/NP)+1 for p € [0, Nrm — 1] 
9: NM, = int(NM/NP) for p € [Nrm, NP — 1] 
10: End if 

11: Calculate start index and end index of matrix set. 


r—2 
: Mstartr = X, NMp +1 
p=0 
r-1 


13: Mendr = X, NM, 
p= 


14: 
15: 
16: 
17: 
18: 


0 

Calculate start and end index of unique node. 
Calculate start and end index of energy group. 
Calculate Local matrix sets on each processor. 
Do UniqeNode = U start, Uend 

Do Group = Gstart, Gend 
19: Calculate LocalMatrixSets(UniqueNode, Group) 
20: End Do 
21: End Do 
22: Processor 0 gathers all LocalMatrixSets from other processors 
and store them in GlobalMatrixSets. 
23: Processor 0 transfers GlobalMatrixSets to each processor. 
24: we cess 
25: MPI Finish 


32 Fig. 2. The subdomains are coupled through interface nodes 
321 located along their edges. The primary challenge in paral- 
322 lelization lies in identifying the processes that require parallel 
323 communication and determining the effective way to imple- 
324 Ment it. 


X-direction 
domain decomposition 


Y-direction r 
domain decomposition 


* Z-direction t 
domain decomposition 


Fig. 2. (Color online) 3D non-overlapping domain decomposition. 


The iteration process of HVNM, mentioned in Section II A, 
reveals that the update of the eigenvalue in the FS iteration 
and the solution of the RM equation in the WG iteration 
require data transfer between subdomains. The eigenvalue 
updates can be parallelized through collective manipulation. 
The designated processor gathers the individual contributions 
to the total fission source from all processors, computes the 
next estimate of the eigenvalue, and broadcasts this value to 
all other processors. This ensures that all processors have 
consistent and updated eigenvalue estimates. 

Conversely, the parallelization for the solution of the RM 
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336 equation is more complex. When solving the RM equation, 
337 the global nodes are colored red and black, ensuring that ad- 
338 jacent nodes have different colors. Fig. 3 shows the red-black 
coloring scheme in a 2D domain. Based on the principle of 
continuity, it can be deduced that the incoming partial current 
on a surface of the red node is equal to the outgoing partial 
current on the same surface of the adjoining black node and 
vice versa. This equality relationship is applied to update the 
4 incoming partial current, while the RM equation is used to up- 
date the outgoing partial current. Obviously, the data transfer 
34 between subdomains is necessary when updating the partial 
347 Current defined across the boundaries of subdomains. A sim- 
ss ple illustration of the data transfer is presented in Fig. 4. 
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345 


Fig. 3. (Color online) Red-black coloring scheme in a 2D domain. 


In each subdomain, the partial currents are first updated 
through a loop over nodes in the order of red nodes followed 
by black nodes. Once a sweep of all red nodes or all black 
nodes is completed, the two adjacent subdomains engage in 
simultaneous point-to-point communication to exchange par- 
tial currents on each boundary, as illustrated in Fig. 4. In 
355 Fig. 4, the yellow arrows indicate the direction of data trans- 
356 fer after solving all red nodes, while the green arrows indicate 
the direction of data transfer after solving all black nodes. 
When the partial currents of all red nodes have been updated, 
each subdomain will engage in the exchange of updated par- 
tial currents of red nodes with its neighboring subdomains on 
the boundaries. Subsequently, each subdomain will utilize 
the received partial currents to update the partial currents of 
the black nodes on the boundaries. Likewise, parallel com- 
munication follows a similar process after updating all black 
nodes. The parallel algorithm for the solution process pre- 
serves the benefits of RBGS, ensuring that the incoming par- 
tial current used for updating the outgoing partial current on 
the subdomain’s boundaries is the most up to date. Thus, the 
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39 parallel algorithm ensures a satisfactory convergence speed 
370 When solving the WG response matrix equation. During each 
371 WG iteration, the number of point-to-point communications 
372 is equal to twice the number of adjoining subdomain bound- 
373 aries. Taking Fig. 4 as an example, one WG iteration needs 2 
374 X 4 point-to-point communications. At the end of all nodes 
sweep, the designated processor gathers the individual con- 
376 tributions to the iteration error from all processors, computes 
377 the final iteration error, and broadcasts this value to all other 
37s processors. If the partial currents satisfy the convergence cri- 
379 terion, the WG iteration will be terminated. The detailed par- 
330 allelization for WG solution is shown in Algorithm 3. 
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Fig. 4. (Color online) Data transfer between subdomains in the par- 
allelization for WG solution. 


31 In the parallel algorithm for solution process, there are 
382 three factors that can affect the parallel performance. First, 
383 local workload imbalances may occur if the subdomains have 
384 an unequal number of nodes (referred to as local nodes), 
which can lead to load imbalances among processors. Sec- 
336 ond, the ratio of communication effort to local work also in- 
387 creases as the number of subdomains increases. In summary, 
388 the communication overhead becomes more significant com- 
389 pared to the computational workload, which can have a nega- 
390 tive impact on the efficiency of the parallel algorithm. Third, 
31 communication imbalances may arise when subdomains have 
392 different numbers of communicated boundaries. Subdomains 
393 located in the middle of the problem typically have more ad- 
334 jacent subdomains to communicate with compared to those 
35 on the surfaces. This imbalance may also affect the parallel 
3% performance. 
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Algorithm 3 Parallelization for WG solution 

1: MPI Initialization 

2s seu 

3: LocalNodes is total number of nodes in subdomain. 
4: NumRed is number of red nodes in subdomain. 
WG iteration: Do i = 1, Imax 

6: Loop over nodes in each subdomain. 

7: Node sweep: Do n = 1, LocalNodes 

8 Update iz of each nodes. 


10: = Sn + Rj, 

11: it = = NumRed) Then 

12: Send j; of red nodes on boundaries to adjacent 
processors. 

13: Receive j,* of red nodes on boundaries from adjacent 
processors. 

14: Else if (n = LocalNodes) Then 

15: Send j} of black nodes on boundaries to adjacent 
processors. 

16: Receive j} of black nodes on boundaries from 
adjacent processors. 

17: End If 

18: End Do Node sweep 

19: Collect iteration error from all processors. 

20: Ifj} coverged, EXIT 


21: End Do WG iteration 
QD eases 
23: MPI Finish 


Ill. RESULTS AND DISCUSSION 


The foregoing parallel algorithm has been implemented 
through a revision of the VITAS code. In this section, the 
accuracy and performance of the algorithm are evaluated us- 
ing the KAIST problem [27] and the JRR-3 problem [33, 34]. 
These problems represent challenging scenarios in terms of 
computational requirements and spatial heterogeneity, mak- 
ing them suitable for assessing the performance of the parallel 
algorithm. It is worth noting that applying HVNM to plate- 
type assemblies in the JRR-3 problem involves modeling the 
internal structure of the reactor with mm-level grids, which 
poses significant challenges and represents the first attempt at 
applying this method to such reactors. Furthermore, the use 
of JRR-3 problem for verification and analysis of the paral- 
lel algorithm underscores the appropriateness of the proposed 
parallel algorithm in tackling various types of heterogeneous 
transport problems. 

The evaluation of parallel performance is measured using 
speedup (S) and efficiency (€). Speedup is defined as the ra- 
tio of the sequential run time (7), estimated using the run 
time with one processor (Tı), to the parallel run time when 
using P processors (Tp). Efficiency measures the utilization 
of resources in the parallel system. 


T; Ti 
Dy = Sa 
Tp T, (13) 
P 
Ep = — 


41- All the following computations were performed on the PI 
42 2.0 cluster supported by the Center for High Performance 
423 Computing at Shanghai Jiao Tong University. The PI 2.0 
cluster consists of 654 compute nodes. Each compute node 
is equipped with two Intel Xeon Scalable Cascade Lake 6248 


CPUs @ 2.5GHz, with each CPU having 20 cores. 
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A. KAIST problems 


427 


The KAIST problem is derived from the MOX benchmark 
problem 2A proposed by KAIST in South Korea. It repre- 
sents a simplified model of a light-water reactor with 52 fuel 
assemblies surrounded by a water reflector. The problem is 
simplified to an 1/8 core by applying reflective B.C. on the 
south, west and bottom sides of the core for reducing compu- 
tational complexity. The lateral geometry of the eighth core is 
illustrated in Fig. 5, including three types of fuel assemblies: 
UOX-1, UOX-2, and MOX-1. Each assembly consists of 289 
pin cells arranged in a 17x17 pin layout. The UOX-1 and 
UOX-2 assemblies comprise UO% pin cells with enrichment 
of 2.0% and 3.3%, respectively. The MOX-1 assembly con- 
tains three different types of MOX pin cells with enrichment 
of 4.3%, 7.0%, and 8.7%. The geometry of each pin cell is 
illustrated in the upper right corner of Fig. 5, where the cir- 
43 Cle area can represent fuel, moderator, or control rod, while 
444 the area between circle and square represents moderator. The 
445 height of core is 150 cm with 15 cm reflectors on the top. 
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2.0% UO, 
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Vacuum 


Reflective 


4.3% MOX 


7.0% MOX 
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Guide Tube /Control Rod 


Moderator 


Reflective 


Fig. 5. (Color online) The lateral layout of the KAIST problem. 


446 
447 
48 The calculations employ seven group macroscopic cross- 
sections, which can be found in Ref. [27]. Each pin cell is 
treated as one node in radial, and the whole problem is evenly 
divided into 10 layers in axial. Thus, there are 85 x 85 x 10 
nodes in the problem. Each node has the dimension of 1.26 
453 cm X 1.26 cm x 15 cm. The fuel pin cells are meshed using 
454 five radial rings for the fuel zone, one radial ring for the mod- 
455 erator zone, and eight azimuthal sectors. Each fuel pin cell 
e Comprises 48 quadratic finite elements, as shown in Fig. 6. 
4s7 The meshing scheme for control rod pin cells and guide tube 
458 pin cells follows the same pattern as the fuel pin cells, with the 
4s9 Only difference being the replacement of fuel material with 
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the corresponding control rod material or guide tube mate- 
rial. The FSR acceleration method is employed to accelerate 
the calculations, treating each finite element as one FSR. We 
specify 48 quadratic x-y finite elements in each node, using 
2" order polynomials in the axial direction. On the lateral 
and axial interfaces, 2"" order polynomials and 48 piecewise 
constants are employed, respectively. Angular integrals are 
evaluated utilizing a 25 x 25 Square Legendre-Chebyshev 
(SLC) cubature. On the nodal interfaces, Py n expansions are 
employed where P„ represents the approximations on the in- 
terface after applying QRIC method to eliminate high-order 
angular moments from n+1 through N. Table 1 summarizes 
detailed calculation settings for the KAIST problem, includ- 
ing the expansion orders, the convergence tolerance, and the 
applied acceleration methods. The sensitive analysis for the 
spatial and angular expansion order indicates that this set of 
discretization schemes are adequate to eliminate the errors 
associated with the spatial and angular approximations. For 
az brevity, the detail of sensitive analysis is omitted in this paper. 
479 
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Fig. 6. (Color online) The quadratic finite element grid for three pin 
cells in the KAIST problem. 
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Table 1. Calculation settings for the KAIST problem. 


Value 

48 Quadratic x-y finite elements 
2" order polynomials 

2" order polynomials 

48 piecewise constants 

25 x 25 SLC cubature 

P23 3 


Calculation parameters 

Volume spatial expansion in x-y 
Volume spatial expansion in z 
Surface spatial expansion in x/y 
Surface spatial expansion in z 
Volume angular integrals 

Py order on the lateral interfaces 


Py order on the axial interfaces P3 ı 
Fission source tolerance 5.0 x 107° 
Eigenvalue tolerance 1.0 x 107° 
Flux tolerance 1.0 x 1075 
Tolerance for WG iteration 1.0 x 1077 
FSR acceleration YES 
PM acceleration YES 
QRIC acceleration YES 


482 


483 


484 1. Accuracy comparison 


48s We performed both serial and parallel computations using 
e multiple MPI processors to verify the accuracy of the parallel 
7 algorithm. As a comparison, the numerical results are com- 


pared with those obtained from MG Monte Carlo (MC) calcu- 
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lation, which served as the reference solution. In the MC cal- 
culation using MCNP, a simulation of 5 million particles per 
batch was performed, with a total of 500 batches, of which 
300 batches were skipped. The large amount of particles was 
sufficient for an accurate simulation, and the statistical devia- 
tion of the eigenvalue was 3 pem. 

Table 2 presents the comparison results of the eigenvalue 
and axially integrated pin fission rate, including eigenvalue 
error, maximum fission rate percent error (MAX), average fis- 
sion rate percent error (AVG), and root mean square (RMS) of 
the fission rate percent error. Table 2 only presents the results 
obtained from parallel computations using 48 processors, for 
the sake of brevity, due to obtaining identical results with dif- 
ferent numbers of parallel processors. From Table 2, it is ob- 
served that the parallel calculation yields the same eigenval- 
ues and fission rate as the serial code: the eigenvalue error 
is 31 pcm, while the RMS of the fission rate percent error is 
1.22%; this demonstrates the correct implementation of the 
parallel algorithm. The normalized fission rate distribution 
is depicted in Fig. 7(a). It can be observed that sharp power 
gradients emerge throughout the core, with the power peak 
positioned at the interface between the MOX-1 and UOX-1 
assemblies. Fig. 7(b) shows the percent error distribution of 
the fission rate. 


Table 2. Comparison of results for the KAIST problem. 


Method HVNM MC (Ref.) 

Number of processors 1 48 / 

Eigenvalue 1.14395 1.14395 1.14364 

Eigenvalue error (pcm) 31 31 / 
RMS 1.22 1.22 / 

Pin fission rate (%) MAX 4.49 4.49 / 
AVG 0.83 0.83 / 


(a) Normalized fission rate distribution 


(b) Fission rate percent error distribution/% 


Fig. 7. (Color online) The normalized fission rate distribution and 
percent error distribution for the KAIST problem. 


2. Parallel performance analysis 


This section focuses on analyzing the parallel performance 
of the parallel algorithm using two metrics: speedup and par- 
allel efficiency, to assess its effectiveness. The speedup mea- 
sures the extent to which parallel computation is faster than 
its sequential counterpart. The parallel efficiency assesses the 
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utilization of computational resources in a parallel computa- 
tion. These metrics are calculated using Eq. (13). 

Table 3 compares the computation effort and parallel per- 
formance using 1, 4, 8, 12, 18, 25, and 36 processors. The 
computation time, speedup and efficiency for response matrix 
formation (referred to as formation time/speedup/efficiency) 
and solution (referred to as solution time/speedup/efficiency) 
are provided. All parallel computations are performed using 
a single compute node to mitigate the impact of inter-node 
communication on parallel efficiency. Based on Table 3, it 
is evident that as the number of participating processors in 
parallel computation increases, the computation time signifi- 
cantly decreases, leading to an increase in speedup. The over- 
all speedup with 36 processors exceeds 24.0. This demon- 
strates the effectiveness of parallelization in reducing the to- 
tal computational time. Furthermore, increasing the number 
of processors generally results in a decrease in parallel ef- 
ficiency. When the number of processors increases from 4 
to 36, the formation efficiency, solution efficiency, and over- 
all efficiency decrease from 88.92%, 93.31%, and 93.31% to 
43.83%, 70.50%, and 68.51%, respectively. This decrease is 
primarily attributed to the growing proportion of communica- 
tion overhead compared to local work. 


Table 3. Comparison of computation effort and parallel performance 
for the KAIST problem. 


Number of processors 1 4 8 12 18 25 36 

Formation time (h) 0.79 0.22 0.12 0.10 0.09 0.07 0.05 
Solution time (h) 15.80 4.22 2.30 1.66 1.20 0.80 0.62 
Total time (h) 16.64 4.45 2.42 1.76 1.29 0.87 0.67 
Formation speedup / 3.56 6.48 8.11 8.38 10.74 15.78 
Solution speedup / 3.74 6.87 9.49 13.21 19.80 25.38 
Overall speedup / 3.73 6.85 9.42 12.85 19.03 24.66 
Formation efficiency (%) / 88.92 80.95 67.61 46.53 42.95 43.83 
Solution efficiency (%) / 93.31 85.91 79.10 73.38 79.19 70.50 
Overall efficiency (%) / 93.31 85.65 78.46 71.41 76.12 68.51 


Regarding matrix formation, the workload assigned to each 
processor decreases as the number of processors increases be- 
cause of the reduced number of local matrix sets. However, 
the communication overhead required to construct the global 
matrix sets becomes more significant with an increased num- 
ber of processors, resulting in a larger portion of time spent 
on communication surpasses the time saved by distributing 
the workload across multiple processors. Consequently, the 
efficiency for matrix formation drops from nearly 90% with 4 
processors to only 43.83% with 36 processors. Additionally, 
the workload imbalance caused by the uneven distribution of 
matrix sets can also impact parallel efficiency. This workload 
imbalance may become more pronounced as the number of 
processors increases, further decreasing parallel efficiency. 

During the solution process, the local workload can be 
measured by the number of local nodes, while communica- 
tion overhead is associated with the number of communi- 
cation boundaries between subdomains. Table 4 presents a 
comparison of communication and local work in the solution 
process, including the maximum and minimum numbers of 
local nodes and communication boundaries. A significant de- 
crease in the local nodes is observed as the number of proces- 


56! 


D 


56 


a 


568 
56! 


© 


570 


57 


572 
573 
574 


57! 


a 


571 


a 


577 


578 


579 


581 


=] 


58 


582 


583 


584 


587 


609 


sors increases, while the number of communication bound- 
aries increases. This leads to a decrease in the ratio of local 
work to communication efforts. For instance, as the number 
of local nodes decreases from 18,490/17,640 to 2,250/1,960, 
the number of communication boundaries increases from 2/2 
to 4/2. Furthermore, as mentioned in Section IIB 2, subdo- 
mains situated in the middle of the problem generally exhibit 
a larger number of communication boundaries compared to 
those on the surfaces. This communication imbalance fur- 
ther reduces efficiency. The extent of communication imbal- 
ance can be estimated by calculating the relative difference 
between the maximum number of communication boundaries 
and the minimum number of communication boundaries. As 
indicated in Table 4, the number of communication bound- 
aries is 2/2 with 4 processors, but it is 4/2 with 36 processors. 
Therefore, the communication imbalance may become more 
severe as the number of processors increases. 


Table 4. Comparison of communication and local work for the 
KAIST problem. 


Number of Subdomain Max/Min Max/Min 
processors distribution local nodes communication boundaries 

4 (2,2,1)  18,490/17,640 2/2 

8 (2,2,2) 9,245/8,820 3/3 

12 (2,3,2) 6,235/5,880 4/3 

18 (3,3,2) 4,206/3,920 5/3 

25 (5,5,1) 2,890/2,890 4/2 

36 (6,6,1) 2,250/1,960 4/2 


Fig. 8 depicts a visualized representation of the parallel 
performance with varying numbers of processors. Fig. 8 
illustrates that the overall parallel performance is predomi- 
nantly influenced by the parallel performance of the solution 
phase. This is due to the relatively insignificant contribution 
of the formation time compared to the solution time. For in- 
stance, considering the results obtained using a single proces- 
sor, the ratio of solution time to formation time is 20.0. This 
implies that the response matrix formation is more susceptible 
to the escalating communication overhead resulting from in- 
creased processors, compared to the solution phase. As illus- 
trated in Fig. 8(a), the solution speedup exhibits a nearly lin- 
ear growth trend as the number of processors increases, while 
the formation speedup progresses at a relatively slower pace. 
This discrepancy becomes particularly noticeable when the 
number of processors rises from 12 to 18, where the forma- 
tion speedup remains almost unchanged. Conversely, the ef- 
ficiency of matrix formation experiences a more pronounced 
decline with an increasing number of processors compared to 
the solution efficiency, as depicted in Fig. 8(b). 

It is worth noting, as observed from Fig. 8(b), that the so- 
lution efficiency actually increases when the number of pro- 
cessors transitions from 18 to 25. This improvement in ef- 
ficiency is likely attributed to load balancing. As shown in 
Table 4, when utilizing 18 processors, a significant load im- 
balance exists, with a maximum/minimum number of local 
nodes of 4206/3920. This imbalance leads to some processors 
being underutilized while others are overloaded. However, 
when the number of processors is adjusted to 25, each pro- 
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(a) Speedup (b) Efficiency 
Fig. 8. (Color online) Parallel performance versus number of MPI 
professors for the KAIST problem. 


cessor is assigned subdomain with an equal number of local 
nodes. The equal distribution of local nodes among proces- 
sors promotes load balance in the solution process, ultimately 
contributing to enhanced parallel efficiency. 


B. JRR-3 problems 


We model the JRR-3 problem to demonstrate the appli- 
cability of the parallel algorithm to a spatial domain with 
a more complex geometry structure. This problem is con- 
structed based on the Japan Research Reactor No.3 (JRR- 
3) [33, 34] designed by Japan Atomic Energy Research Insti- 
tute (JAERI). JRR-3 is a water-cooled research reactor using 
plate-type fuels. The geometric representation of the JRR-3 
reactor is illustrated in Fig. 9. The reactor core is composed 
of 26 standard fuel assemblies, 6 follow fuel assemblies with 
neutron absorber and 5 glory hole assemblies. Surrounding 
the core is a baffle with a thickness of 1 cm and an inner ra- 
dius of 30.0 cm. Furthermore, there is a 30 cm axial reflector 
located at the top and bottom of the reactor. The lateral geom- 
etry of typical assemblies is illustrated in Fig. 10. All assem- 
blies have dimensions of 7.72 cm x 7.72 cm. The standard 
fuel assembly comprises 20 evenly arranged fuel plates, each 
with a thickness of 0.076 cm and a length of 6.16 cm. The fol- 
low fuel assembly consists of 16 fuel plates, also with a thick- 
ness of 0.076 cm, but a shorter length of 4.9 cm. The absorber 
assembly incorporates an absorber material with a thickness 
of 0.5 cm. Further detailed parameters of assemblies can be 
found in Ref. [34]. The calculations employ seven group 
macroscopic cross-sections, which are provided in Ref. [34]. 
The reference solutions for all the cases in this problem were 
obtained from the MC code RMC [35-37]. In the MC cal- 
culation using RMC, a simulation of 10 million particles per 
batch were performed, with a total of 800 batches, of which 
300 batches were skipped. The statistical deviation of the 
eigenvalue was | pcm. 


1. Accuracy comparison 


(1) Assembly cases 

Adhering to a progressive approach, we initially perform 
calculations for 2D fuel assemblies. We divide the standard 
fuel assembly into nodes of 7 x 20, while dividing the follow 
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Fig. 9. (Color online) The geometric diagram of the JRR-3 problem. 
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(b) Follow fuel assembly 
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Fig. 10. (Color online) The lateral geometry of typical assemblies in 
the JRR-3 problem. 


e3 fuel assembly into nodes of 6 x 18, to facilitate the compari- 
654 son Of the fission rates of fuel plates, as illustrated in Fig. 11. 
ess Given the intricate composition of assemblies in the JRR-3 
problem, a more refined finite element grid is necessitated to 
7 accurately represent the geometry within each unique node, 
in contrast to the grids employed in the KAIST problem. The 
size of the finite element grids is even smaller than 0.05 cm, 
as illustrated in Fig. 12. During the calculation, P1; 3 spheri- 
cal harmonics and 2" order polynomials are employed on the 
lateral interfaces, while retaining the remaining calculation 
e3 parameters identical to those employed in the KAIST prob- 
e4 lem. Concerning the parallel calculation, the fuel assembly 
ees is decomposed into 2 x 8 subdomains, with each subdomain 
ees assigned to an individual processor. 
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(a) Standard fuel assembly 


(b) Follow fuel assembly 


Fig. 11. (Color online) The node division scheme for fuel assemblies 
in the 2D assembly cases of the JRR-3 problem. 
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Fig. 12. (Color online) The finite element grids for a follow fuel 
assembly in the 2D assembly cases of the JRR-3 problem. 


671 low fuel assembly. It can be observed that using 16 proces- 
672 sors for computation yields results that closely align with the 
673 reference results. The eigenvalue error is below 50 pcm, and 
674 the RMS of plate fission rate percent error is less than 0.1%. 
67s This not only demonstrates the feasibility of HVNM in deal- 
676 ing with plate-type fuel assemblies but also confirms the cor- 
677 rectness of the parallel algorithm. 


Table 5. Comparison of results for the 2D assembly cases of the 
JRR-3 problem. 


Assembly type Standard fuel Follow fuel 
assembly assembly 
Number of processors 16 16 
Eigenvalue 1.43143 1.32202 
Eigenvalue error (pcm) 22 44 
RMS 0.04 0.07 
Fission rate error (%) MAX 0.12 0.20 
AVG 0.03 0.05 


Fig. 13 illustrates the fission rate distribution of the fuel 
plate for the standard fuel assembly and the follow fuel as- 
sembly. In the standard fuel plates, the fission rates are ho- 
mogenized into 5 sections, while in the follow fuel plates, 
they are homogenized into 4 sections. It can be found that the 
e83 fuel plate segments located near the periphery of the assem- 
bly exhibit higher fission rates compared to the ones located 
at the center of the assembly. 

(2) Whole-core cases 

In the whole-core calculation, the entire reactor is divided 
into 9 x 9 assemblies, with each assembly further subdivided 
into 7 x 20 nodes. Both 2D and 3D whole-core cases are 
examined. In the 3D calculation, the axial direction is divided 
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Fig. 13. (Color online) Normalized fission rate distribution of the 
fuel assemblies for the 2D assembly cases of the JRR-3 problem. 


into 45 layers, each with a height of 3 cm. The spatial and 
angular expansion schemes for the whole-core calculation are 
listed in Table 6. All other calculation parameters remain the 
same as those used in the KAIST problem. 
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Table 6. Spatial and angular expansion scheme for the whole-core 
case of the JRR-3 problem. 


Value 


Calculation parameters 


whole core 


2D 3D 


whole core 


Volume spatial expansion in x-y 
Surface spatial expansion in x/y 
Volume angular integrals 


Quadrilateral finite elements 
2" order polynomials 
25 x 25 SLC cubature 


Py order on the lateral interfaces Pii3 

Volume spatial expansion in z / 2" order polynomials 
Surface spatial expansion in z / 1 piecewise constant 
Py order on the axial interfaces Ps 3 


695 


69i 


© 


69 


ef 


698 
69 


© 


70l 


(=) 


Table 7 presents the comparison results for the eigen- 
value and axially integrated assembly fission rates. In the 
2D and 3D whole-core cases, the eigenvalue errors are -56 
pcm/-90 pcm, and the RMS of fission rate percent error are 
0.54%/0.65%, respectively. These results indicate the high 
accuracy achieved through the parallelization of HVNM. 


Table 7. Comparison results for the whole-core cases of the JRR-3 


problem. 

Case 2D whole-core 3D whole-core 

Number of processors 80 88 

Eigenvalue 0.92157 0.88133 

Eigenvalue error (pcm) -56 -90 
RMS 0.54 0.65 

Fission rate error (%) MAX 1.42 1.72 
AVG 0.37 0.47 


70 


Fig. 14 illustrates the normalized fission rate distribution 
of the assemblies, excluding the assemblies in the reflector 
region. It can be observed that the fuel assemblies positioned 
at the central region of the core display higher fission rates 
compared to the assemblies located at the periphery of the 
core. Fig. 15 illustrates the error distribution of the assembly 
fission rates, with an error range of -0.94% to 1.72%. The 
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maximum error is observed in the assembly near the reflector 
region. 


(b) 3D whole-core 


(a) 2D whole-core 


Fig. 14. (Color online) Normalized fission rate distribution of the 
whole-core cases in the JRR-3 problem. 
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Fig. 15. (Color online) Fission rate error distribution of the whole- 
core cases in the JRR-3 problem. 


2. Parallel performance analysis 


The 3D whole-core case is more capable of demonstrat- 
ing the superiority of the parallel algorithm due to the sig- 
nificantly larger computational workload compared to the as- 
sembly cases and 2D whole-core cases; this is evident from 
the results presented in Table ??. As shown in Table ??, the 
total number of spatial-angular degrees of freedom for the 3D 
whole-core case exceeds 10 million, posing a significant chal- 
lenge for the computational resources and indicating the ne- 
cessity of employing a parallel algorithm for 3D whole-core 
calculations. Consequently, this section focuses on analyzing 
the performance of the parallel algorithm specifically for the 
3D whole-core case. Furthermore, considering the computa- 
tional memory and time constraints associated with perform- 
ing 3D whole-core calculations using a single processor, the 
results with 36 processors are taken as the baseline for evalu- 
ating the parallel performance. Accordingly, the definition of 
speedup and efficiency is adjusted to 
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where T} represents run time with 36 processors. 

The comparative results regarding the computational effort 
and parallel performance, employing 36, 60, and 144 proces- 
sors, are presented in Table 8. In all cases, the calculations are 
performed using an equal number of compute nodes, specif- 
ically 12, to minimize the impact of inter-node communica- 
tion on parallel performance. It is observed that the paral- 
lel algorithm demonstrates efficient acceleration and parallel 
efficiency. Additionally, the parallel performance exhibits a 
similar trend to that observed in the KAIST problem as the 
number of processors increases. The total computation time 
decreases from 1.24 h using 36 processors to 0.40 h using 144 
processors, yielding a speedup of 3.09. The overall efficiency 
stands at 95.42% and 77.14% using 60 and 144 processors, 
respectively. The underlying reasons for this trend have been 
extensively discussed in Section III A 2 and will not be reit- 
erated here. However, in contrast to the KAIST problem, the 
JRR-3 problem has a lower proportion of solution time. For 
instance, with 36 processors, the formation time accounts for 
93% of the total time in the KAIST problem, while in the 
JRR-3 problem, it constitutes only 65%. Consequently, the 
solution efficiency of the JRR-3 problem exerts a compara- 
tively less dominant influence on the overall efficiency when 
compared to the KAIST problem. 


Table 8. Comparison of computation effort and parallel performance 
for the 3D whole-core case of the JRR-3 problem. 


Number of processors 36 60 144 
Formation time (h) 0.44 0.29 0.16 
Solution time (h) 0.80 0.49 0.24 
Total time (h) 1.24 0.78 0.40 
Formation speedup / 1.52 2:13 
Solution speedup / 1.63 3.33 
Overall speedup / 1.59 3.09 
Formation efficiency (%) / 91.47 68.16 
Solution efficiency (%) / 97.74 83.13 
Overall efficiency (%) / 95.42 77.14 


IV. CONCLUSIONS 


In this paper, we introduce an efficient parallel algorithm 
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